10 research outputs found

    Optimization of the Distributed I/O Subsystem of the k-Wave Project

    Get PDF
    Práce se zabývá řešením efektivního paralelního zápisu velkých objemů dat na souborovém systému Lustre. Cílový program je navržen pro projekt k-Wave simulující šíření akustických a ultrazvukových vln. Tato simulace pro svou výpočetní a datovou náročnost vyžaduje spouštění na superpočítači a implementaci pomocí knihoven pro paralelní zpracování (Open MPI) a pro uložení velkých objemů dat (HDF5). Výsledný program je implementován v jazyce C s využitím zmíněných knihoven. Správným nastavením souborového systému Lustre bylo dosaženo rychlosti 2,5 GB/s, jež odpovídá 5-ti násobnému zrychlení nativního zápisu, který byl následně pomocí techniky agregace dat zrychlen až na 3 GB/s, což naráží na teoretické limity diskového pole superpočítače Anselm.This thesis deals with an effective solution of parallel writing of variable amounts of data on the Lustre file system. The work will be used by the k-Wave project designed for time domain acoustic and ultrasound simulations. Since the simulation is computationally and data intensive, the project requires to be implemented with libraries for parallel computig (Open MPI) and large data processing (HDF5) and it must run on a supercomputer. The application is implemented in C and uses previously mentioned libraries. The proper settings of the Lustre file system leads to the peak write bandwith of 2.5 GB/s that corresponds to a speedup factor of 5 compared to the reference settings. The data aggregation improved the write bandwidth by a factor of 3 compared to a naive version. Here, the achieved I/O bandwidth for certain block sizes hits the limits of the Anselm I/O subsytem (3GB/s).

    Optimization of the Distributed I/O Subsystem of the k-Wave Project

    Get PDF
    Práce se zabývá řešením efektivního paralelního zápisu a čtení dat pro nástroj k-Wave, provádějící simulací šíření ultrazvuku. Tento nástroj je superpočítačovou aplikací, proto je spouštěn na souborovém systému Lustre a vyžaduje paralelní zpracování pomocí MPI a zápis ve formátu vhodném pro velké množství dat (HDF5). V rámci této práce byly navrženy metody efektivního způsobu zápisu dat dle potřeb k-Wave, pomocí kumulace dat a přerozdělování. Všechny metody zrychlily nativní zápis a vedly až k rychlosti zápisu 13,6GB/s. Popsané metody jsou použitelné pro všechny aplikace s distribuovanými daty a častým zápisem.This thesis deals with an effective solution of the parallel I/O of the k-Wave tool, which is designed for time domain acoustic and ultrasound simulations. k-Wave is a supercomputer application, it runs on a Lustre file system and it requires to be implemented with MPI and stores the data in suitable data format (HDF5). I designed three methods of optimization which fits k-Wave's needs. It uses accumulation and redistribution techniques. In comparison with the native write, every optimization method led to better write speed, up to 13.6GB/s. It is possible to use these methods to optimize every data distributed application with the write speed issue.

    MERIC and RADAR generator: tools for energy evaluation and runtime tuning of HPC applications

    Get PDF
    This paper introduces two tools for manual energy evaluation and runtime tuning developed at IT4Innovations in the READEX project. The MERIC library can be used for manual instrumentation and analysis of any application from the energy and time consumption point of view. Besides tracing, MERIC can also change environment and hardware parameters during the application runtime, which leads to energy savings. MERIC stores large amounts of data, which are difficult to read by a human. The RADAR generator analyses the MERIC output files to find the best settings of evaluated parameters for each instrumented region. It generates a Open image in new window report and a MERIC configuration file for application production runs

    Domain knowledge specification for energy tuning

    Get PDF
    To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient Exascale computing) project uses an online auto-tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre-computes optimal system configurations at design-time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a region's characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.Web of Science316art. no. E465

    Domain Knowledge Specification for Energy Tuning

    Get PDF
    The European Horizon 2020 project READEX is developing a tool suite for dynamic energy tuning of HPC applications. While the tool suite supports an automatic approach, domain knowledge can significantly help in the analysis and the runtime tuning phase. This paper presents the means available in READEX for the application expert to provide his expert knowledge to the tool suite

    Optimization of the Distributed I/O Subsystem of the k-Wave Project

    No full text
    This thesis deals with an effective solution of parallel writing of variable amounts of data on the Lustre file system. The work will be used by the k-Wave project designed for time domain acoustic and ultrasound simulations. Since the simulation is computationally and data intensive, the project requires to be implemented with libraries for parallel computig (Open MPI) and large data processing (HDF5) and it must run on a supercomputer. The application is implemented in C and uses previously mentioned libraries. The proper settings of the Lustre file system leads to the peak write bandwith of 2.5 GB/s that corresponds to a speedup factor of 5 compared to the reference settings. The data aggregation improved the write bandwidth by a factor of 3 compared to a naive version. Here, the achieved I/O bandwidth for certain block sizes hits the limits of the Anselm I/O subsytem (3GB/s)

    Application instrumentation for performance analysis and tuning with focus on energy efficiency

    No full text
    Profiling and tuning of parallel applications is an essential part of HPC. Analysis and elimination of application hot spots can be performed using many available tools, which also provides resource consumption measurements for instrumented parts of the code. Since complex applications show different behavior in each part of the code, it is essential to be able to insert instrumentation to analyse these parts. Because each performance analysis or autotuning tool can bring different insights into an application behavior, it is valuable to analyze and optimize an application using a variety of them. We present our on request inserted shared C/C++ API for the most common open-source HPC performance analysis tools, which simplify the process of the manual instrumentation. Besides manual instrumentation, profiling libraries provide different methods for instrumentation. Of these, the binary patching is the most universal mechanism, and highly improves the user-friendliness and robustness of the tool. We provide an overview of the most commonly used binary patching tools, and describe a workflow for how to use them to implement a binary instrumentation tool for any profiler or autotuner. We have also evaluated the minimum overhead of the manual and binary instrumentation.Web of Scienc

    DGX-A100 face to face DGX-2-performance, power and thermal behavior evaluation

    No full text
    Nvidia is a leading producer of GPUs for high-performance computing and artificial intelligence, bringing top performance and energy-efficiency. We present performance, power consumption, and thermal behavior analysis of the new Nvidia DGX-A100 server equipped with eight A100 Ampere microarchitecture GPUs. The results are compared against the previous generation of the server, Nvidia DGX-2, based on Tesla V100 GPUs. We developed a synthetic benchmark to measure the raw performance of floating-point computing units including Tensor Cores. Furthermore, thermal stability was investigated. In addition, Dynamic Frequency and Voltage Scaling (DVFS) analysis was performed to determine the best energy-efficient configuration of the GPUs executing workloads of various arithmetical intensities. Under the energy-optimal configuration the A100 GPU reaches efficiency of 51 GFLOPS/W for double-precision workload and 91 GFLOPS/W for tensor core double precision workload, which makes the A100 the most energy-efficient server accelerator for scientific simulations in the market.Web of Science142art. no. 37

    Evaluation of the HPC applications dynamic behavior in terms of energy consumption

    No full text
    This paper introduces the READEX project tuning approach which exploits the dynamic application behavior and its potential for energy savings. The paper is focused on themanual applications evaluation from the energy consumption optimisation point of view. As an examples we have selected one complex application, the ESPRESO library and two simplified applications from the ProxyApps benchmark tool suite. ESPRESO containsmany types of operations including I/O, communication, sparse BLAS and dense BLAS. The results show that static savings are 5.6–12.3% and dynamic savings are 4.7–9.1%. The highest total savings for ESPRESO are 21.4% as a combination of 12.3% static savings and 9.1% dynamic savings. The ProxyApp applications, Kripke and Lulesh, were presented for two configurations each. The first configuration of the Kripke saved 29.3% energy, almost only by static tuning. On the other hand, the second configuration shows us only 18.8% savings, but over a third of it was saved by dynamic switching CPU core and uncore frequencies. The Lulesh test cases saved 28.9%, respectively 26.7%.Web of Scienc

    A massively parallel and memory-efficient FEM toolbox with a hybrid total FETI solver with accelerator support

    No full text
    In this article, we present the ExaScale PaRallel finite element tearing and interconnecting SOlver (ESPRESO) finite element method (FEM) library, which includes an FEM toolbox with interfaces to professional and open-source simulation tools, and a massively parallel hybrid total finite element tearing and interconnecting (HTFETI) solver which can fully utilize the Oak Ridge Leadership Computing Facility Titan supercomputer and achieve superlinear scaling. This article presents several new techniques for finite element tearing and interconnecting (FETI) solvers designed for efficient utilization of supercomputers with a focus on (i) performance—we present a fivefold reduction of solver runtime for the Laplace equation by redesigning the FETI solver and offloading the key workload to the accelerator. We compare Intel Xeon Phi 7120p and Tesla K80 and P100 accelerators to Intel Xeon E5-2680v3 and Xeon Phi 7210 central processing units; and (ii) memory efficiency—we present two techniques which increase the efficiency of the HTFETI solver 1.8 times and push the limits of the largest possible problem ESPRESO that can solve from 124 to 223 billion unknowns for problems with unstructured meshes. Finally, we show that by dynamically tuning hardware parameters, we can reduce energy consumption by up to 33%
    corecore